Search CORE

132 research outputs found

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Author: Chan William
Hori Takaaki
Watanabe Shinji
Zhang Yu
Publication venue
Publication date: 08/06/2017
Field of study

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.Comment: Accepted for INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

REAL-TIME ONE-PASS DECODING WITH RECURRENT NEURAL NETWORK LANGUAGE MODEL FOR SPEECH RECOGNITION

Author: Atsushi Nakamura
Takaaki Hori
Yotaro Kubo
Publication venue
Publication date: 04/09/2014
Field of study

This paper proposes an efficient one-pass decoding method for realtime speech recognition employing a recurrent neural network language model (RNNLM). An RNNLM is an effective language model that yields a large gain in recognition accuracy when it is combined with a standard n-gram model. However, since every word probability distribution based on an RNNLM is dependent on the entire history from the beginning of the speech, the search space in Viterbi decoding grows exponentially with the length of the recognition hypotheses and makes computation prohibitively expensive. Therefore, an RNNLM is usually used by N-best rescoring or by approximating it to a back-off n-gram model. In this paper, we present another approach that enables one-pass Viterbi decoding with an RNNLM without approximation, where the RNNLM is represented as a prefix tree of possible word sequences, and only the part needed for decoding is generated on-the-fly and used to rescore each hypothesis using an on-the-fly composition technique we previously proposed. Experimental results on the MIT lecture transcription task show that our proposed method enables one-pass decoding with small overhead for the RNNLM and achieves a slightly higher accuracy than 1000-best rescoring. Furthermore, it reduces the latency from the end of each utterance in two-pass decoding by a factor of 10. Index Terms — Speech recognition, Recurrent neural network language model, Weighted finite-state transducer, On-the-fly rescorin

CiteSeerX

Crossref